63 research outputs found

    On Weighted Multivariate Sign Functions

    Full text link
    Multivariate sign functions are often used for robust estimation and inference. We propose using data dependent weights in association with such functions. The proposed weighted sign functions retain desirable robustness properties, while significantly improving efficiency in estimation and inference compared to unweighted multivariate sign-based methods. Using weighted signs, we demonstrate methods of robust location estimation and robust principal component analysis. We extend the scope of using robust multivariate methods to include robust sufficient dimension reduction and functional outlier detection. Several numerical studies and real data applications demonstrate the efficacy of the proposed methodology.Comment: Keywords: Multivariate sign, Principal component analysis, Data depth, Sufficient dimension reductio

    Generalized bootstrap for estimating equations

    Full text link
    We introduce a generalized bootstrap technique for estimators obtained by solving estimating equations. Some special cases of this generalized bootstrap are the classical bootstrap of Efron, the delete-d jackknife and variations of the Bayesian bootstrap. The use of the proposed technique is discussed in some examples. Distributional consistency of the method is established and an asymptotic representation of the resampling variance estimator is obtained.Comment: Published at http://dx.doi.org/10.1214/009053604000000904 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Parametric bootstrap approximation to the distribution of EBLUP and related prediction intervals in linear mixed models

    Full text link
    Empirical best linear unbiased prediction (EBLUP) method uses a linear mixed model in combining information from different sources of information. This method is particularly useful in small area problems. The variability of an EBLUP is traditionally measured by the mean squared prediction error (MSPE), and interval estimates are generally constructed using estimates of the MSPE. Such methods have shortcomings like under-coverage or over-coverage, excessive length and lack of interpretability. We propose a parametric bootstrap approach to estimate the entire distribution of a suitably centered and scaled EBLUP. The bootstrap histogram is highly accurate, and differs from the true EBLUP distribution by only O(d3nβˆ’3/2)O(d^3n^{-3/2}), where dd is the number of parameters and nn the number of observations. This result is used to obtain highly accurate prediction intervals. Simulation results demonstrate the superiority of this method over existing techniques of constructing prediction intervals in linear mixed models.Comment: Published in at http://dx.doi.org/10.1214/07-AOS512 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Feature Selection using e-values

    Full text link
    In the context of supervised parametric models, we introduce the concept of e-values. An e-value is a scalar quantity that represents the proximity of the sampling distribution of parameter estimates in a model trained on a subset of features to that of the model trained on all features (i.e. the full model). Under general conditions, a rank ordering of e-values separates models that contain all essential features from those that do not. The e-values are applicable to a wide range of parametric models. We use data depths and a fast resampling-based algorithm to implement a feature selection procedure using e-values, providing consistency results. For a pp-dimensional feature space, this procedure requires fitting only the full model and evaluating p+1p+1 models, as opposed to the traditional requirement of fitting and evaluating 2p2^p models. Through experiments across several model settings and synthetic and real datasets, we establish that the e-values method as a promising general alternative to existing model-specific methods of feature selection.Comment: accepted in ICML-202

    Simultaneous Selection of Multiple Important Single Nucleotide Polymorphisms in Familial Genome Wide Association Studies Data

    Full text link
    We propose a resampling-based fast variable selection technique for selecting important Single Nucleotide Polymorphisms (SNP) in multi-marker mixed effect models used in twin studies. Due to computational complexity, current practice includes testing the effect of one SNP at a time, commonly termed as `single SNP association analysis'. Joint modeling of genetic variants within a gene or pathway may have better power to detect the relevant genetic variants, hence we adapt our recently proposed framework of ee-values to address this. In this paper, we propose a computationally efficient approach for single SNP detection in families while utilizing information on multiple SNPs simultaneously. We achieve this through improvements in two aspects. First, unlike other model selection techniques, our method only requires training a model with all possible predictors. Second, we utilize a fast and scalable bootstrap procedure that only requires Monte-Carlo sampling to obtain bootstrapped copies of the estimated vector of coefficients. Using this bootstrap sample, we obtain the ee-value for each SNP, and select SNPs having ee-values below a threshold. We illustrate through numerical studies that our method is more effective in detecting SNPs associated with a trait than either single-marker analysis using family data or model selection methods that ignore the familial dependency structure. We also use the ee-values to perform gene-level analysis in nuclear families and detect several SNPs that have been implicated to be associated with alcohol consumption

    Distribution-free cumulative sum control charts using bootstrap-based control limits

    Full text link
    This paper deals with phase II, univariate, statistical process control when a set of in-control data is available, and when both the in-control and out-of-control distributions of the process are unknown. Existing process control techniques typically require substantial knowledge about the in-control and out-of-control distributions of the process, which is often difficult to obtain in practice. We propose (a) using a sequence of control limits for the cumulative sum (CUSUM) control charts, where the control limits are determined by the conditional distribution of the CUSUM statistic given the last time it was zero, and (b) estimating the control limits by bootstrap. Traditionally, the CUSUM control chart uses a single control limit, which is obtained under the assumption that the in-control and out-of-control distributions of the process are Normal. When the normality assumption is not valid, which is often true in applications, the actual in-control average run length, defined to be the expected time duration before the control chart signals a process change, is quite different from the nominal in-control average run length. This limitation is mostly eliminated in the proposed procedure, which is distribution-free and robust against different choices of the in-control and out-of-control distributions.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS197 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore